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Abstract 

Partly motivated by entropy-estimation problems in neuroscience, we present a 
detailed and extensive comparison between some of the most popular and effective 
entropy estimation methods used in practice: The plug-in method, four different es- 
timators based on the Lempel-Ziv (LZ) family of data compression algorithms, an 
estimator based on the Context-Tree Weighting (CTW) method, and the renewal 
entropy estimator. 

Methodology. Three new entropy estimators are introduced; two new LZ-based 
\q \ estimators, and the "renewal entropy estimator," which is tailored to data generated 

(*C) ■ by a binary renewal process. For two of the four LZ-based estimators, a bootstrap 

procedure is described for evaluating their standard error, and a practical rule of 
. thumb is heuristically derived for selecting the values of their parameters in practice. 

Theory. We prove that, unlike their earlier versions, the two new LZ-based esti- 
mators are universally consistent, that is, they converge to the entropy rate for every 
finite-valued, stationary and ergodic process. An effective method is derived for the 



> 



O 

^ ■ accurate approximation of the entropy rate of a finite-state HMM with known distri- 
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bution. Heuristic calculations are presented and approximate formulas are derived for 
evaluating the bias and the standard error of each estimator. 

Simulation. All estimators are applied to a wide range of data generated by nu- 
merous different processes with varying degrees of dependence and memory. The main 
conclusions drawn from these experiments include: (i) For all estimators considered, 
the main source of error is the bias, (ii) The CTW method is repeatedly and consis- 
tently seen to provide the most accurate results. (Hi) The performance of the LZ-based 
estimators is often comparable to that of the plug-in method, (iv) The main draw- 
back of the plug-in method is its computational inefficiency; with small word-lengths 
it fails to detect longer-range structure in the data, and with longer word-lengths the 
empirical distribution is severely undersampled, leading to large biases. 
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1 Introduction 

The problem of estimating the entropy of a sequence of discrete observations has received a 
lot of attention over the past two decades. A, necessarily incomplete, sample of the theory 
that has been developed can be found in [27][3][l5][IS][23][I^ 

and the references therein. Examples of numerous different applications are contained in 
the above list, as well as in [5] [7] p] 02] [H] 03] [S] [25] [M] [35] [2S] @] [28] . 

Information-theoretic methods have been particularly widely used in neuroscience, in 
a broad effort to analyze and understand the fundamental information-processing tasks 
performed by the brain. In many of these these studies, the entropy is adopted as the 
main measure for describing the amount of information transmitted between neurons. 
There, one of the most basic tasks is to identify appropriate methods for quantifying this 
information, in other words, to estimate the entropy of spike trains recorded from live 
animals; see, e.g., @2][5l][l3][35][26][ig 

Motivated, in part, by the application of entropy estimation techniques to such neu- 
ronal data, we present a systematic and extensive comparison, both in theory and via 
simulation, between several of the most commonly used entropy estimation techniques. 
This work serves, partly, as a more theoretical companion to the experimental work and 
results presented in [13j[12j[14j. There, entropy estimators were applied to the spike trains 
of 28 neurons recorded simultaneously for a one-hour period from the primary motor and 
dorsal premotor cortices (MI, PMd) of a monkey. The purpose of those experiments was 
to examine the effectiveness of the entropy as a statistic, and its utility in revealing some 
of the underlying structural and statistical characteristics of the spike trains. In contrast, 
our main aim here is to examine the performance of several of the most effective entropy 
estimators, and to establish fundamental properties for their applicability, such as rigorous 
estimates for their convergence rates, bias and variance. In particular, since (discretized) 
spike trains are typically represented as binary sequences [36], some of our theoretical 
results and all of our simulation experiments are focused on binary data. 

Section 2 begins with a description of the entropy estimators we consider. The simplest 
one is the plug-in or maximum likelihood estimator, which consists of first calculating the 
empirical frequencies of all words of a fixed length in the data, and then computing the 
entropy of this empirical distribution. For obvious computational reasons, the plug-in is 
ineffective for word-lengths beyond 10 or 20, and hence it cannot take into account any 
potential longer-time dependencies in the data. 

A popular approach for overcoming this drawback is to consider entropy estimators 
based on "universal" data compression algorithms, that is, algorithms which are known 
to achieve a compression ratio equal to the entropy, for data generated by processes which 
may possess arbitrarily long memory, and without any prior knowledge about the distri- 
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bution of the underlying process. Since, when trying to estimate the entropy, the actual 
compression task is irrelevant, many entropy estimators have been developed as modifica- 
tions of practical compression schemes. Section \2 . 31 describes two well-known such entropy 
estimators [22], which are based on the Lempel-Ziv (LZ) family of data compression al- 
gorithms [59j[60j. These estimators are known to be consistent (i.e., to converge to the 
correct value for the entropy) only under certain restrictive conditions on the data. We 
introduce two new LZ-based estimators, and we prove in Theorem 12.11 that, unlike the 
estimators in [22], they are consistent under essentially minimal conditions, that is, for 
data generated by any stationary and ergodic process. 

Section 12.41 contains an analysis, partly rigorous and partly in terms of heuristic com- 
putations, of the rate at which the bias and the variance of each of the four LZ-based 
estimators converges to zero. A bootstrap procedure is developed for empirically estimat- 
ing the standard error of two of the four LZ-based estimators, and a practical rule-of-thumb 
is derived for selecting the values of the parameters of these estimators in practice. 

Next, in Section 12.51 we consider an entropy estimator based on the Context-Tree 
Weighting (CTW) algorithm [53J [54J |52j. [In the neuroscience literature, a similar pro- 
ceedure has been applied in [18] and [26] .] The CTW, also originally developed for data 
compression, can be interpreted as a Bayesian estimation procedure. After a brief descrip- 
tion, we explain that it is consistent for data generated by any stationary and ergodic 
process and show that its bias and variance are, in a sense, as small as can beQ 

Section [3] contains the results of an extensive simulation study, where the various en- 
tropy estimators are applied to data generated from numerous different types of processes, 
with varying degrees of dependence and memory. In Section 13. 1[ after giving brief de- 
scriptions of all these data models, we present (Proposition 13. If) a method for accurately 
approximating the entropy rate of a Hidden Markov Model (HMM); recall that HMMs 
are not known to admit closed-form expressions for their entropy rate. Also, again partly 
motivated by neuroscience applications, we introduce another new entropy estimator, the 
renewal entropy estimator, which is tailored to binary data generated by renewal processes. 

Section 13.21 contains a detailed examination of the bias and variance of the four LZ- 
based estimators and the CTW algorithm. There, our earlier theoretical predictions are 
largely confirmed. Moreover, it is found that two of the four LZ-based estimators are 
consistently more accurate than the other two. 

Finally, Section 13.31 contains a systematic comparison of the performance of all of the 
above estimators on different types of simulated data. Incidental comparisons between 
some of these methods on limited data sets have appeared in various places in the literature, 
and questions have often been raised regarding their relative merits. One of the main goals 
of the work we report here is to offer clear resolutions for many of these issues. Specifically, 
in addition to the points mentioned up to now, some of the main conclusion that we draw 

1 The CTW also has another feature which, although important for applied statistical analyses such as 
those reported in connection with our experimental results in [12] [14]. will not be explored in the present 
work: A simple modification of the algorithm can be used to compute the maximum a posteriori probability 
tree model for the data; see Section \2. 51 for some details. 
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from the simulation results of Section [3] (these and more are collected in Section d|) can be 
summarized as follows: 

• Due its computational inefficiency, the plug-in estimator is the least reliable method, 
in contrast to the LZ-based estimators and the CTW, which naturally incorporate 
dependencies in the data at much larger time scales. 

• The most effective estimator is the CTW method. Moreover, for the CTW as well 
as for all other estimators, the main source of error is the bias and not the variance. 

• Among the four LZ-based estimators, the two most efficient ones are those with 
increasing window sizes, H n of [22] and H n introduced in Section 12.31 Somewhat 
surprisingly, in several of the simulations we conducted the performance of the LZ- 
based estimators appears to be very similar to that of the plug-in method. 

2 Entropy Estimators and Their Properties 

This section contains a detailed description of the entropy estimators that will be applied 
to simulated data in Section [3l After some basic definitions and notation in Section 12.11 
the following four subsections contain the definitions of the estimators together with a 
discussion of their statistical properties, including conditions for consistency, and estimates 
of their bias and variance. 

2.1 Entropy and Entropy Rate 

Let A be a random variable or random vector, taking values in an arbitrary finite set A, 
its alphabet, and with distribution p{x) = Pr{A = x} for x £ A. The entropy of A [8] is 
defined as, 

H(X) = H (p) = - p( x ) lo g V{%) , 
x<=A 

where, throughout the paper, log denotes the logarithm to base 2, log 2 - A random process 
X = {. . . , X—i, Xq, Xi,X2, . . .} with alphabet A is a sequence of random variables {A ra } 
with values in A. We write X\ = (Aj, Aj+i, ... ,Xj) for a (possibly infinite) contiguous 
segment of the process, with — oo < i < j < oo, and x\ = (xj,Xj + i, . . . ,Xj) for a specific 
realization of X\, so that xj is an element of A^ l+1 . The entropy rate H = H(X), or 
"per-symbol" entropy, of X is the asymptotic rate at which the entropy of A™ changes 
with n, 

H = H (X)= lim -H(X 1 ,X 2 ,...,X n ), (1) 

n— »oo n 

whenever the limit exists, where H(Xi, A 2 , . . . , X n ) is the entropy of the jointly distributed 
random variables A™ = (Ai, A 2 , . . . , X n ). Recall [8] that for a stationary process (i.e., 
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a process such that the distribution of every finite block X™+i of size k has the same 
distribution, say pk, independently of its position n), the entropy rate exists and equals, 



H = H(X)= lim H(X n \X n „ l ,...,X 2 ,X 1 ), 

n—>oc 

where the conditional entropy H{X n | X™ -1 ) = H(X n | X n -i, . . . , X2, X\) is defined as, 



H(X n I X?- 1 ) = ]T p n (x^) logPr{X„ = x n I Xf- 1 = x^ 1 } 

Pn(Xi) 



Pn{Xi)log- 



x n &A n Pn-lOl ) 

As mentioned in the introduction, much of the rest of the paper will be devoted to 
binary data produced by processes X with alphabet A = {0, 1}. 



2.2 The Plug-in Estimator 

Perhaps the simplest and most straightforward estimator for the entropy rate is the so- 
called plug-in estimator. Given a data sequence x™ of length n, and an arbitrary string, 
or "word," yf £ A w of length w < n, let p w (yf) denote the empirical probability of the 
word yf in x"; that is, p w {yf) is the frequency with which yf appears in x™. If the 
data are produced from a stationary and ergodico process, then the law of large numbers 
guarantees that, for fixed w and large n, the empirical distribution p w will be close to the 
true distribution p w , and therefore a natural estimator for the entropy rate based on ([I]) 
is: . 1 

-ffrwpiug-in = —H(p w ) = — Y] p w (yf) log p w (yf). 

W W * — ' 

This is the plug-in estimator with word-length w. Since the empirical distribution is also 
the maximum likelihood estimate of the true distribution, this is also often referred to as 
the maximum-likelihood entropy estimator. 

Suppose the process X is stationary and ergodic. Then, taking w large enough for 
—H(X'i) to be acceptably close to H in ([T]), and assuming the number of samples n is much 
larger than w so that the empirical distribution of order w is close to the true distribution, 
the plug-in estimator H n ,io,plug-in wm produce an accurate estimate for the entropy rate. 
But, among other difficulties, in practice this leads to enormous computational problems 
because the number of all possible words of length w grows exponentially with w. For 
example, even for the simple case of binary data with a modest word-length of w = 30, 
the number of possible strings yf is 2 30 , which in practice means that the estimator would 

2 Recall that ergodicity is simply the assumption that the law of large numbers holds in its general form 
(the ergodic theorem); see, e.g., [40] for details. This assumption is natural and, in a sense, minimal, in 
that it is hard to even imagine how any kind of statistical inference would be possible if we cannot even 
rely on taking long-term averages. 
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either require astronomical amounts of data to estimate p w accurately, or it would be 
severely undersampled. See also |31j and the references therein for a discussion of the 
undersampling problem. 

Another drawback of the plug-in estimator is that it is hard to quantify its bias and to 
correct for it. For any fixed word-length w, it is easy to see that the bias E [-ffn,w,plug-hi] — 
— H(Xf) is always negative pQ [30], whereas the difference between the wth-order per- 
symbol entropy and the entropy rate, -^H(Xf) — H(X), is always nonnegative. Still, 
there have been numerous extensive studies on calculating this bias and on developing 
ways to correct for it; see (33] [3D] [51] [35J [42J [28] and the references therein. 

2.3 The Lempel-Ziv Estimators 

An intuitively appealing and popular way of estimating the entropy of discrete data with 
possibly long memory, is based on the use of so-called universal data compression algo- 
rithms. These are algorithms that are known to be able to optimally compress data from 
an arbitrary process (assuming some broad conditions are satisfied), where optimality 
means that the compression ratio they achieve is asymptotically equal to the entropy rate 
of the underlying process - although the statistics of this process are not assumed to be 
known a priori. Perhaps the most commonly used methods in this context are based on a 
family of compression schemes known as Lempel-Ziv (LZ) algorithms; see, e.g., [59j[60j[56j. 

Since the entropy estimation task is simpler than that of actually compressing the 
data, several modified versions of the original compression algorithms have been proposed 
and used extensively in practice. All these methods are based on the calculation of the 
lengths of certain repeating patterns in the data. Specifically, given a data realization 
x = (. . . , X-i, xo, x\, X2, ■ ■ •), for every position i in x and any "window length" n > 1, 
consider the length £ of the longest segment xf"- x in the data starting at i which also 
appears in the window x l ~} n of length n preceding position i. Formally, define Lf = 
Lf(x) = L™(x-+™ _1 ) as 1+ [that longest match-length]: 

Lf = V}{x) = L^xtT 1 ) 

= 1 + max{0 < I < n : x l ^~ x = Xj^ 1 for some i — n < j < i — 1}. 

Suppose that the process X is stationary and ergodic, and consider the random match- 
lengths Lf = Lf(X^^~ 1 ). In [56][29] it was shown that, for any fixed position i, the 
match-lengths grow logarithmically with the window size n, and in fact, 

- — > — as n — > do, with probability 1, (2) 

logn H 

where H is the entropy rate of the process. This result suggests that the quantity 
(logn)/L" can be used as an entropy estimator, and, clearly, in order to make more 
efficient use of the data and reduce the variance, it would be more reasonable to look at 
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the average value of various match-lengths Lf taken at different positions i; see the dis- 
cussion in [22]. To that effect, the following two estimators are considered in [22]. Given 
a data realization x = x^^, a window length n > 1, and a number of matches k > 1, the 
sliding-window LZ estimator H n ^ = H n ,k( x ) = Hn^x 7 ^^ 1 ) is defined by, 



H, 



n,k 



1 k 
h E 



logn 



(3) 



Similarly, the increasing-window LZ estimator H n = H n (x) = H n (xQ n 1 ) is defined by, 



H,, 



1 n 

-E 

i=2 



log z 



(4) 



The difference between the two estimators in © and (j3|) is that H n ,k uses a fixed 
window length, while H n uses the entire history as its window, so that the window length 
increases as the matching position moves forward. 

In [22] it is shown that, under appropriate conditions, both estimators H n ± and H n 
are consistent, in that they converge to the entropy rate of the underlying process with 
probability 1, as n, k — > oo. Specifically, it is assumed that the process is stationary 
and ergodic, that it takes on only finitely many values, and that it satisfies the Doeblin 
Condition (DC). This condition says that there is a finite number of steps, say r, in the 
process, such that, after r time steps, no matter what has occurred before, anything can 
happen with positive probability: 

Doeblin Condition (DC). There exists an integer r > 1 and a real number 
> such that, 

Pv(X r = a\X° 00 ) > A 

for all a G A and with probability one in the conditioning, i.e., for almost all 
semi-infinite realizations of the past = (Xo,X_i, . . .). 

Condition (DC) has the advantage that it is not quantitative - the values of r and (3 
can be arbitrary - and, therefore, for specific applications it is fairly easy to see whether 
it is satisfied or not. But it is restrictive, and, as it turns out, it can be avoided altogether 
if we consider a modified version of the above two estimators. 

To that end, we define two new estimators H n ^ and H n as follows. Given x = x 00 ^, n 
and k as above, define the new sliding-window estimator H n ^ = H n ^(x) = H n) k(x r !^ l ^ 1 ) , 



H 



n,k 



1 k 

k E 



log n 



(5) 
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and the new increasing-window estimator H n = H n (x) = H n (x n ) as, 

~ l^Alogi 
n L- 

Below some basic properties of these four estimators are established, and conditions 
are given for their asymptotic consistency. Parts (i) and (iii) of Theorem 12.11 are new; 
most of part (ii) is contained in [22J. 

Theorem 2.1. [Consistency of LZ-type Estimators] 

(i) When applied to an arbitrary data string, the estimators defined in (J3J) — dSJ) always 
satisfy, 

H n ,k — H n ^ and H n < H n , 

for any n, k. 

(ii) The estimators H n ^ and H n are consistent when applied to data generated by a 
finite- valued, stationary, ergodic process that satisfies Doeblin's condition (DC). With 
probability one we have: 

H n k H, H n — ► H , as k,n —* oo. 

(iii) The estimators H n ^ and H n are consistent when applied to data generated by an 
arbitrary finite- valued, stationary, ergodic process, even if (DC) does not hold. With 
probability one we have: 

H n f. — > H, H n — * H , as k,n — » oo. 

Note that parts (ii) and (iii) do not specify the manner in which n and k go to infinity. 
The results are actually valid in the following cases: 

1. If the two limits as n and k tend to infinity are taken separately, i.e., first k — > oo 
and then n — » oo, or vice versa; 

2. If k — > oo and n = varies with k in such a way that n& — > oo as k — > oo; 

3. If n — > oo and k = k n varies with n in such a way that it increases to infinity as 

n — > oo; 

4. If k and n both vary arbitrarily in such a way that k stays bounded and n — > oo. 



Proof. Part (i). An application of Jensen's inequality to the convex function x i— ► 1/x, 
with respect to the uniform distribution (1/k, . . . , 1/k) on the set {1, 2, . . . , k}, yields, 

H nk = Y llo % n = Y l _\Jl_r l > \ylJl_]- 1 = H nk , 

n ' t—' k Lf ^ k Llog n\ L^-'A;lognJ 

i=l 1 i=l i=l 
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as required. The proof of the second assertion is similar. 

Part (ii). The results here are, for the most part, proved in [22], where it is established 
that H n — ► H and H nn — > H as n — > oo, with probability one. So it remains to show that 
-ffn,fc ~~ -H" as n ) —> 00 m each of the four cases stated above. 
For case 1 observe that, with probability 1, 



lim lim H n f. 

k n 



lim 

k 



1 L 

Elim 
n 



k 



i=l 



log n 



(a) 



lim 

k 



k^H 



(7) 



where (a) follows from ([2]). To reverse the limits, we define, for each fixed n, a new 
process {Y^ n) } = {..., Y"i ™ } , Y^ n) , r/ n) , y 2 (n) , . . .} by letting Y- (n) = Lf/(log n) for each i. 

Then the process {Y^} is itself stationary and ergodic. Recalling also from [22] that the 
convergence in ([2]) takes place not only with probability one but also in L 1 , we may apply 
the ergodic theorem to obtain that, with probability 1, 



lim lim H n & 

n k 



lim 

n 



k 



in) 



i=l 



(6) 



(«)m-i 



lim [E(Yrn] 

n 

H, 



limE[ 

n \ log n 



where (b) follows by the ergodic theorem and (c) from the L 1 version of ([2]). 

The proof of case 2 is identical to the case k = n considered in [22J. In case 3, 
since the sequence {k n } is increasing, the limit of Hk n , n reduces to a subsequence of the 
corresponding limit in case 2 upon considering the inverse sequence {n^}. 

Finally for case 4, recall from ([7]) that, 



\\mH, 



k.n 



H with prob. 1, 



for any fixed k. Therefore the same will hold with a varying k, as long as it varies among 
finitely many values. 

Part (iii). The proofs of the consistency results for H n and H n ^ can be carried out 
along the same lines as the proofs of the corresponding results in [22J, together with their 
extensions as in Part (ii) above. The only difference is in the main technical step, namely, 
the verification of a uniform integrability condition. In the present case, what is needed 
is to show that, 



E 



log n ' 



f log n I 
L n>l -ki > 



< 00. 



This is done in the following lemma. 



(8) 

□ 
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Lemma 2.2. Under the assumptions of part (iii) of the theorem, the L 1 -domination con- 
dition ([8]) holds true. 

Proof. Given a data realization x = (. . . ,x_2, x—i, Xq, xi, X2, ■ ■ •) and an m > 1, the 
recurrence time R m is denned as the first time the substring x™ appears again in the past. 
More precisely, R m is the number of steps to the left of x™ we have to look in order to 
find a copy of x™: 

R m = R m (x) = Rmix^) = ini{k >l:x? = xl£+™}- 

For any such realization x and any n > 1, if we take m = L™, then by the definitions of 
R m and L\ it follows that, R m > n, which implies, 

logi? m logn 



and thus it is always the case that, 



log n log Rr, 

sup — — < sup ■ 



n m rn 

Therefore, to establish ([8]) it suffices to prove: 



s i su P— — f<°°- (9) 



To that end, we expand this expectation as, 

logi? m -i ^ r logi? r , 



sup > < y Pi < sup > k > 

I- m m J ^ I m m J 

k>0 



< K+^Pr{sup^^>fc} 



• log 



k>Km>l 



where K is an arbitrary integer to be chosen later. Applying Markov's inequality, 

£{su P i^) < K+VVwr^. (10) 



m 

k>Km>\ 



To calculate the expectation of R m , suppose that the process X takes on a = \A\ possible 
values, so that there are a m possible strings X™ of length m. Now recall Kac's theorem 
[56] which states that E(R m \ X? = xf) = l/Vr{Xf = xf}, from which it follows that, 

E(R m ) = Y J E(R m I Xf = xf) • Pr{Xf = xf} = a m . (11) 
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Combining (fTU|) and (fTTj) yields, 




log R, 



in 



m(k— log o) 



sup 



m 



k>Km>l 




2~ (fc— log a) 



1 _ 2-0-l°ga) 



< 00, 



fc>X a 



where we choose K > logo. This establishes ([9]) and completes the proof. 



□ 



2.4 Bias and Variance of the LZ-based Estimators 

In practice, when applying the sliding-window LZ estimators H n ^ or on finite data 
strings, the values of the parameters k and n need to be chosen, so that k + n is approxi- 
mately equal to the total data length. This presents the following dilemma: Using a long 
window size n, the estimators are more likely to capture the longer-term trends in the 
data, but, as shown in [33] [45], the match-lengths L" starting at different positions i have 
large fluctuations. So a large window size n and a small number of matching positions 
k will give estimates with high variance. On the other hand, if a relatively small value 
for n is chosen and the estimate is an average over a large number of positions k, then 
the variance will be reduced at the cost of increasing the bias, since the expected value of 
/ log n is known to converge to 1/H very slowly [55] . 

Therefore n and k need to be chosen in a way such that the above bias/variance trade- 
off is balanced. From the earlier theoretical results of [33J [45J [46J [55] [57] it follows that, 
under appropriate conditions, the bias is approximately of the order 0(1/ log n), whereas 
from the central limit theorem it is easily seen that the variance is approximately of order 
0(1 /k). This indicates that the relative values of n and k should probably be chosen to 
satisfy k 0((logn) 2 ). 

Although this is a useful general guideline, we also consider the problem of empiri- 
cally evaluating the relative estimation error on particular data sets. Next we outline a 
bootstrap procedure, which gives empirical estimates of the variance H n ^ and H n ^ ] an 
analogous method was used for the estimator H n ^ in [44J, in the context of estimating the 
entropy of whale songs. 

Let L denote the sequence of match- lengths L = (L™, , Lr? > • • • > computed directly 
from the data, as in the definitions of H n ^ and H n ,k- Roughly speaking, the proposed 
procedure is carried out in three steps: First, we sample with replacement from L in 
order to obtain many pseudo-time series with the same length as L; then we compute 
new entropy estimates from each of the new sequences using H n ^k or H n ^; and finally 
we estimate the variance of the initial entropy estimates as the sample variance of the 
new estimates. The most important step is the sampling, since the elements of sequence 



11 



(L™, . . . , L£) are not independent. In order to maintain the right form of dependence, we 
adopt a version of the stationary bootstrap procedure of [33]. The basic idea is, instead 
of sampling individual L"'s from L, to sample whole blocks with random lengths. The 
choice of the distribution of their lengths is made in such a way as to guarantee that they 
are typically long enough to maintain sufficient dependence as in the original sequence. 
The results in [33] provide conditions which justify the application of this procedure. 

The details of the three steps above are as follows: First, a random position j £ 
{1, 2, . . . , k} is selected uniformly at random, and a random length T is chosen with geo- 
metric distribution with mean 1/p (the choice of p is discussed below). Then the block of 
match-lengths (L™, . . . , Lj+T-i) is copied from L, and the same process is repeated 
until the concatenation L* of the sampled blocks has length k. This gives the first boot- 
strap sample. Then the whole process is repeated to generate a total of B such blocks 
L* 1 , L* 2 , . . . , L* B , each of length k. From these we calculate new entropy estimates H*(m) 
or H*(m), for m = 1,2, ... ,B, according to the definition of the entropy estimator being 
used, as in ([3]) or ([5]) , respectively; the choice of the number B of blocks is discussed below. 
The bootstrap estimate of the variance of H is, 

m=l 

where fl = B^ 1 Ylm=i H*(m); similarly for H. 

The choice of the parameter p depends on the length of the memory of the match- 
length sequence L = (L™, LV^ , . . . , LjJ); the longer the memory, the larger the blocks need 
to be, therefore, the smaller the parameter p. In practice, p is chosen by studying the 
autocorrelogram of L, which is typically decreasing with the lag: We choose an appropriate 
cutoff threshold, take the corresponding lag to be the average block size, and choose p as 
the reciprocal of that lag. Finally, the number of blocks B is customarily chosen large 
enough so that the histogram of the bootstrap samples H*(1),H*(2), . . . ,H*(B) "looks" 
approximately Gaussian. Typical values used in applications are between B = 500 and 
B = 1000. In all our experiments in [12] [14J and in the results presented in the following 
section we set B = 1000, which, as discussed below, appears to have been sufficiently large 
for the central limit theorem to apply to within a close approximation. 

2.5 Context-Tree Weighting 

One of the fundamental ways in which the entropy rate arises as a natural quantity, is in 
the Shannon-McMillan-Breiman theorem [5]; it states that, for any stationary and ergodic 
process X = {. . . , X-i,Xq, Xi, X2, ■ ■ ■} with entropy rate H, 

logp n (X?) — > H, with prob. 1, as n — > 00, (12) 

n 

where p n (Xf) denotes the (random) probability of the random string X™. This suggests 
that, one way to estimate H from a long realization x\ of X , is to first somehow estimate 
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its probability and then use the estimated probability Pn{x\) to obtain an estimate for 
the entropy rate via, 

#n,est = --logp n (x?). (13) 

n 

The Context-Tree Weighting (CTW) algorithm [53J [54J [52J is a method, originally de- 
veloped in the context of data compression, which can be interpreted as an implementation 
of hierarchical Bayesianprocedure for estimating the probability of a string generated by 
a binary "tree process. "□ The details of the precise way in which the CTW operates can 
be found in [53j[54j|52j; here we simply give a brief overview of what (and not how) the 
CTW actually computes. In order to do that, we first need to describe tree processes. 

A binary tree process of depth D is a binary process X with a distribution defined 
in terms of a suffix set S, consisting of binary strings of length no longer than D, and a 
parameter vector 8 = (9 S ; s G S), where each 9 S G [0, 1]. The suffix set S is assumed 
to be complete and proper, which means that any semi-infinite binary string x°_ ^ has 
exactly one suffix s is S, i.e., there exists exactly one s£S such that x° ^ can be written 
as xZ^s, for some integer We write s = s(x1 oc ) G S for this unique suffix. 

Then the distribution of X is specified by defining the conditional probabilities, 

Pr{X n+1 = 1\X^= x^J = 1 - Pr{X n+1 = | X n _^ = x^} = 9.^. 

It is clear that the process just defined could be thought of simply as a D-th. order Markov 
chain, but this would ignore the important information contained in S: If a suffix string 
s G S has length £ < D, then, conditional on any past sequence xl^ which ends in s, 
the distribution of X n+ \ only depends on the most recent £ symbols. Therefore, the suffix 
set offers an economical way for describing the transition probabilities of X, especially for 
chains that can be represented with a relatively small suffix set. 

Suppose that a certain string x™ has been generated by a tree process of depth no 
greater than D, but with unknown suffix set S* and parameter vector 0* = (9* ; s G S). 
Following classical Bayesian methodology, we assign a prior probability vr(5) on each 
(complete and proper) suffix set S of depth D or less, and, given S, we assign a prior 
probability tt{6 \ S) on each parameter vector 6 = (9 S ). A Bayesian approximation to the 
true probability of x\ (under S* and 6* ) is the mixture probability, 

P D , mix (X?) = n ( S ) [ P sA*lM0 \S)dO, (14) 

s J 

where Pg eipx) is the probability of x" under the distribution of a tree process with suffix 
set S and parameter vector 0. The expression in (|14h is, in practice, impossible to compute 

3 The CTW algorithm is a general method with various extensions, which go well beyond the basic 
version described here. Some of these extensions are mentioned later in this section. 

4 The name tree process comes from the fact that the suffix set S can be represented as a binary tree. 
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directly, since the number of suffix sets of depth < D (i.e., the number of terms in the 
sum) is of order 2 D . This is obviously prohibitively large for any D beyond 20 or 30. 

The CTW algorithm is an efficient procedure for computing the mixture probability 
in ([14"]) . for a specific choice of the prior distributions n(S), tt(8 \ S): The prior on S is, 



ir(S) 



-\S\-N(S)+l 



where \S\ is the number of elements of S and N(S) is the number of string is S with 
length strictly smaller than D. Given a suffix set S, the prior on 8 is the product (i, §)- 
Dirichlet distribution, i.e., under ir(0\S) the individual 9 S are independent, with each 9 S ~ 
Dirichlet(i, \). 

The main practical advantage of the CTW algorithm is that it can actually compute 
the probability in (|14p exactly. In fact, this computation can be performed sequentially, in 
linear time in the length of the string n, and using an amount of memory which also grows 
linearly with n. This, in particular, makes it possible to consider much longer memory 
lengths D than would be possible with the plug-in method. 

The CTW Entropy Estimator. Thus motivated, given a binary string cc", we define 
the CTW entropy estimator H n ^ t ctw as, 

H n Ddw = log Pd, mix (^l), (15) 

n 

where PD,mix(£i) is the mixture probability in (|14p computed by the CTW algorithm. 
[Corresponding proceedures are similarly described in |18||26|.] The justification for this 
definition comes from the discussion leading to equation (|13p above. Clearly, if the true 
probability of is P*(x"), the estimator performs well when, 

--logP D , mix (xi) ~ --logP*(s?). (16) 
n n 

In many cases this approximation can be rigorously justified, and in certain cases it can 
actually be accurately quantified. 

Assume that x™ is generated by an unknown tree process (of depth no greater than 
D) with suffix set S* . The main theoretical result of [53J states that, for any string x™, of 
any finite length n, generated by any such process, the difference between the two terms 
in (|16p can be uniformly bounded above; from this it easily follows that, 



1 



71 



1 



5*1 315*1 + 1 



-logP^mixW)- --logP*(x?) < i^logn+ 1 1 . (17) 



2n n 



This nonasymptotic, quantitative bound, easily leads to various properties of H n ,D, ctw- 

First, ()17|) combined with the Shannon-McMillan-Breiman theorem (]12p and the point- 
wise converse source coding theorem [2] [19], readily implies that &n,d, ctw is consistent, 
that is, it converges to the true entropy rate of the underlying process, with probability 
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one, as n — > oo. Also, Shannon's source coding theorem [8] implies that the expected value 
_D,ctw cannot be smaller than the true entropy rate H; therefore, taking expectations 
in (fTTj) gives, 

0< [bias of H n , D , ctw ) <Plogn + 0(1). (18) 

In 

In view of Rissanen's |37j well-known universal lower bound, (|18p shows that the bias of 
the CTW is essentially as small as can be. Finally, for the variance, if we subtract H 
from both sides of (|17p . multiply by y/n, and apply the central-limit refinement to the 
Shannon-McMillan-Breiman theorem [58||T5]. we obtain that the standard deviation of 
the estimates H n ,D,ctw is ~ o~x/\/n, where a\ is the minimal coding variance of X [21J. 
This is also optimal, in view of the second-order coding theorem of |21j . 

Therefore, for data x\ generated by tree processes, the bias of the CTW estimator is of 
order 0(log n/n), and its variance is 0(l/n). Compared to the earlier LZ-based estimators, 
these bounds suggest much faster convergence, and are in fact optimal. In particular, the 
0(logn/n) bound on the bias indicates that, unlike the LZ-based estimators, the CTW 
can give useful results even on small data sets. 

Extensions. An important issue for the performance of the CTW entropy estimator, 
especially when used on data with potentially long-range dependence, is the choice of the 
depth D: While larger values of D give the estimator a chance to capture longer-term 
trends, we then pay a price in the algorithm's complexity and in the estimation bias. This 
issue will not be discussed further here; a more detailed discussion of this point along with 
experimental results can be found in |12j|14j. 

The CTW algorithm has also been extended beyond finite-memory processes [52]. The 
basic method is modified to produce an estimated probability P^^Xi), without assuming a 
predetermined maximal suffix depth D. The sequential nature of the computation remains 
exactly the same, leading to a corresponding entropy estimator defined analogously to 
the one in (fl5l) . as H n ,oo,ctw = — ^ log Poo(x\ )• Again it is easy to show that H n ,oo,ctw 
is consistent with probability one, this time for every stationary and ergodic (binary) 
process. The price of this generalization is that the earlier estimates for the bias and 
variance no longer apply, although they do remain valid is the data actually come from 
a finite-memory process. In numerous simulation experiments we found that there is no 
significant advantage in using H n ,D, ctw with a finite depth D over H n ,oo,ctw, except for 
the somewhat shorter computation time. For that reason, in all the experimental results 
in Sections 13.21 and 1 3.31 below, we only report estimates obtained by H n ,oc,ctw 

Finally, perhaps the most striking feature of the CTW algorithm is that it can be 
modified to compute the "best" suffix set S that can be fitted to a given data string, 
where, following standard statistical (Bayesian) practice, "best" here means the one which 
is most likely under the posterior distribution. To be precise, recall the prior distributions 
7r(5) and ir(0 \ S) on suffix sets S and on parameter vectors 6, respectively. Using Bayes' 
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rule, the posterior distribution on suffix sets S can be expressed, 

fP St0 (x?MO\S)M 



Pr{5 | xj} 



Pd, 



the suffix set S = S(x™) which maximizes this probability is called the Maximum A pos- 
teriori Probability, or MAP, suffix set. Although the exact computation of S is, typically, 
prohibitively hard to carry out directly, the Context-Tree Maximizing (CTM) algorithm 
proposed in [50j is an efficient procedure (with complexity and memory requirements es- 
sentially identical to the CTW algorithm) for computing S. The CTM algorithm will not 
be used or discussed further in this work; see the discussion in [12] [14] , where it plays an 
important part in the analysis of neuronal data. 



3 Results on Simulated Data 

This section contains the results of an extensive simulation study, comparing various as- 
pects of the behavior of the different entropy estimators presented earlier (the plug-in 
estimator, the four LZ-based estimators, and the CTW estimator), applied to different 
kinds of simulated binary data. Motivated, in part, by applications in neuroscience, we 
also introduce a new method, the renewal entropy estimator. Section T3. II contains descrip- 
tions of the statistical models used to generate the data, along with exact formulas or 
close approximations for their entropy rates. 

Sections 13.21 and 13.31 contain the actual simulation results. 



3.1 Statistical Models and Their Entropy Rates 
3.1.1 I.I.D. (or "homogeneous Poisson") Data 

The simplest model of a binary random process X is as a sequence of independent and 
identically distributed (i.i.d.) random variables {X n }, where the the Xi are independent 
and all have the same distribution. This process has no memory, and in the neuroscience 
literature it is often referred to as a "homogeneous Poisson process." The distribution of 
each Xi is described by a parameter p £ [0, 1], so that Pr{Aj = 1} = p and Pr{A"i = 0} = 
1 — p. If {A n } were to represent a spike train, then p = E{Xi) would be its average firing 
rate. 

The entropy rate of this process is simply, 

H = H(Xi) = —plogp — (1 — p) log(l — p). 



3.1.2 Markov Chains 

An £-th order (homogeneous) Markov chain with (finite) alphabet A is a process X = {Xn} 
with the property that, 

Pr{A n = x n | Xr 1 = xr 1 } = Pr{X n = x n \ X^j = = P„«- W 
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for all Xt 1 G A n , where P = (P^e „ ) is the transition matrix of the chain. This formalizes 
the idea that the memory of the process has length I: The probability of each new symbol 
depends only on the most recent I symbols, and it is conditionally independent of the 
more distant past. The distribution of this process is described by the initial distribution 
of its first I symbols, (ir(x{)), and the transition matrix P. The entropy rate of an ergodic 
(i.e., irreducible and aperiodic) £-th order Markov chain X is given by, 

H = - V 7T*(4) V P** r lo S P ^ r , 

x{ x e+1 

where tt* is the unique stationary distribution of the chain. 
3.1.3 Hidden Markov Models 

Next we consider a class of binary processes X = {X n }, called hidden Markov models 
(HMMs) or hidden Markov processes, which typically have infinite memory. For the 
purposes of this discussion, a binary HMM can be defined as follows. Suppose Y = {Y n } 
is a first-order, ergodic Markov chain, which is stationary, i.e., its initial distribution tt is 
the same as its stationary distribution tt* . Let the alphabet of Y be an arbitrary finite 
set A, write P = {P yy >) for its transition matrix, and let Q = (Q yx \ y G A, x G {0, 1}) 
be a different transition matrix, from A to {0,1}. Then, for each n, given {Yi} and the 
previous values of X™^ 1 , the distribution of the random variable X n is, 

Pr{X n = x | Y n = y} = Q yx , 

independently of the remaining {Yi} and of X™" 1 . The resulting process X = {X n } is a 
binary HMM. 

The consideration of HMMs here is partly motivated by the desire to simulate spike 
trains with slowly varying rates, as in the case of real neuronal firing. To illustrate, 
consider the following (somewhat oversimplified) description of a model that will be used 
in the simulation examples below. Let the Markov chain Y = {Y n } represent the process 
which modulates the firing rate of the binary "spike train" X, so that Y takes a finite 
number of values, A = {r±, V2, ■ ■ ■ , r a }, with each rj G (0,1). These values correspond 
to a different firing regimes, so that, e.g., Y n = r\ means that the average firing rate 
at that instant is r\ spikes-per-time-unit. To ensure that the firing rate varies slowly, 
define, for every y G A, the transition probability that Y n remains in the same state to 
be PrjYn = r \ Y n -\ = r} = 1 — e, for some small e > 0. Then, conditional on {1^}, the 
distribution of each X n is given by Pr{X n = l\Y n = r} = r = 1 — Pr{X n = | Y n = y}. 

In general, an HMM defined as above is stationary, ergodic and typically has infinite 
memory - it is not a £-th order Markov chain for any £; see |10] and the references therein 
for details. Moreover, there is no closed-form expression for the entropy rate of a general 
HMM, but, as outlined below, it is fairly easy to obtain an accurate approximation when 
the distribution of the HMM is known a priori, via the Shannon-McMillan-Breiman the- 
orem (|12p . That is, the value of the entropy rate H = H{X) can be estimated accurately 
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as long as it is possible to get a close approximation for the probability p n (Xf) of a long 
random sample X". This calculation is, in principle, hard to perform, since it requires the 
computation of an average over all possible state sequences y™, and their number grows 
exponentially with n. As it turns out, adapting an idea similar to the usual dynamic 
programming algorithm used for HMM state estimation, the required probability p n (X™) 
can actually be computed very efficiently; similar techniques appear in various places in 
the literature, e.g., [TO][I7]. Here we adopt the following method, developed independently 
in [12]. 

First, generate and fix a long random realization x™ of the HMM X, and define the 
matrices AfW by, 

M y\) = p yy'Qy'x k , y,v' ^ A k = 2,3,...,n, 
and the row vector b = (b y ), 

hy = n(y)Qyxi, y G A - 

The following proposition says that the probability of an arbitrary a?" can be obtained in n 
matrix multiplications, involving matrices of dimension \ A\ x \A\. For moderate alphabet 
sizes, this can be easily carried out even for large n, e.g., on the order of 10 6 . Moreover, 
as the HMM process X inherits the strong mixing properties of the underlying Markov 
chain Y, Ibragimov's central- limit refinement [16] to the Shannon-McMillan-Breiman the- 
orem suggests that the variance of the estimates obtained should decay at the rapid rate 
of 1/re. Therefore, in practice, it should be possible to efficiently obtain a reliable, sta- 
ble approximation for the entropy rate, a prediction we have repeatedly verified through 
simulation H 

Proposition 3.1. Under the above assumptions, the probability of an arbitrary x™ G 
{0, l} n produced by the HMM X can be expressed as, 



Pn{x n l ) = b[\[M^ 

where 1 is the column vector of \A\ Is. 



k=2 



5 To avoid confusion note that, although this method gives a very accurate estimate of the entropy rate 
of an HMM, in order to carry it out it is necessary to know in advance the distributions of both the HMM 
X and of the unobservable process Y. 
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Proof: Let G ^4™ denote any specific realization of the hidden Markov process YJ 1 . We 
have, 

p n (x?) = Pr{Xr = *?} = ^ Pr{Yl™ = y?} Pr{Xf = xj | Y™ = tf} 

n 

= ^(y^Qyixi n Pyk-i,ykQykXk 

y^£A n k=2 

n 

n 

= £(»[n «"*]), 

y„SA fc=2 
n 

fc=2 

□ 

3.1.4 Renewal Processes 

A common alternative mathematical description for the distribution of a binary string is 
via the distribution of the time intervals between successive Is, or, in the case of a binary 
spike train, the interspike intervals (ISIs) as they are often called in the neuroscience 
literature [32] Specifically, let {ti} denote the sequence of times t when Xt = 1, and let 
{Yj = t{ + i — ti} be the sequence of "interarrival times" or ISIs of X = {X n }. Instead of 
defining the joint distributions of blocks X" of random variables from X, its distribution 
can be specified by that of the process Y = {Yi}. For example, if the Yj are i.i.d. random 
variables with geometric distribution with parameter p £ [0, 1] , then X is itself a (binary) 
i.i.d. process with parameter p. 

More generally, a renewal process X is defined in terms of an arbitrary i.i.d. ISI process 
Y, with each Yj having a common discrete distribution P = (pj ; j = 1, 2, . . .)@ 

Recall [ 32] that the entropy rates H(X) of X and H(Y) of Y are related by, 

oo 

H(X) = XH{Y) = -A^ Pi log Pi , 

i=i 

where A = E(X\) = 1/E(Y\). This simple relation motivates the consideration of a 
different estimator for the entropy rate of a renewal process. Given binary data re" of 

6 The consideration of renewal processes in partly motivated by the fact that, as discussed in |14j |12 | . the 
main feature of the distribution of real neuronal data that gets captured by the CTW algorithm (and by 
the MAP suffix set it produces) is an empirical estimate of the distribution of the underlying ISI process. 
Also, simulations suggest that renewal processes produce firing patterns similar to those observed in real 
neurons firing, indicating that the corresponding entropy estimation results are more biologically relevant. 
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length n, calculate the corresponding sequence of ISIs y™, and define, 

-ffn,renewal = — A ^ <Jj log (jj , 

j 

where Q = (q~j) is the empirical distribution of the ISIs y™, and A is the empirical rate of 
X , i.e., the proportion of Is in x™. We call this the renewal entropy rate estimator. 

When the data are indeed generated by a renewal process, the law of large numbers 
guarantees that -ff n ,renewai will converge to H(X), with probability one, as n — > oo. More- 
over, under rather weak regularity conditions on the distribution of the ISIs, the central 
limit theorem implies that the variance of these estimates decays like 0(1/ n). But the 
results of the renewal entropy estimator remain meaningful even if X is not a renewal 
process. For example, if the corresponding ISI process Y is stationary and ergodic but 
not i.i.d., then the renewal entropy estimator will converge to the value XH(Yi), whereas 
the true entropy rate in this case is the (smaller) quantity, 

H{Y) = lim XH {Yi | Y^ u . . . , Y 2 , Y x ). 

i— >oo 

Therefore, the renewal entropy estimator can be employed to test for the presence or 
absence of renewal structure in particular data sets, by comparing the value of -ff n ,renewai 
with that of other estimators; see [14||12| for a detailed such study. 

3.2 Bias and Variance of the CTW and the LZ-based Estimators 

In order to compare the empirical bias and variance of the four LZ-based estimators and the 
CTW estimator with the corresponding theoretical predictions of Sections 12.41 and !2.5l the 
five estimators are applied to simulated (binary) data. [In this as well as in the following 
section, we only present results for the z/i/z/iztc-suffix-depth CTW estimator -f/n,oo,ctwi 
recall the discussion at the very end Section 12.51 ] The simulated data were generated from 
four different processes: An i.i.d. (or "homogeneous Poisson") process, and three different 
Markov chains with orders £ = 1,2, and 10. For each parameter setting of each model, 
50 independent realizations were generated and the bias of each method was estimated 
by subtracting the true entropy rate from the empirical mean of the individual estimates. 
Similarly, the standard error was estimated by the empirical standard deviation of these 
results. 

Specifically, for and H n ^, first a window size n and a number of matches k were 
selected, and then 50 independent realizations of length iV = 2n + k were generated. 
The bias and standard error are plotted in Figure Q] against 0(1/ log n) and 0(l/yk), 
respectively. The approximately linear curves confirm the theoretical predictions that the 
bias and variance decay to zero at rates 0(1/ log n) and 0(l/k), respectively. Note that, 
although H n ^ is always larger than H n ^, there is no systematic trend regarding which one 
gives more accurate estimates. Also plotted on Figure [1] is the bias of infinite-suffix-depth 
CTW estimator, HN,oo,ctw, applied to data with total length N = 2n + k. [The estimated 
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Figure 1: Results obtained by H n k (shown as red lines with circles), by H n ^ (blue lines 
with triangles), and by the CTW estimator Hjv,oo,ctw (green lines with squares), applied to 
data from four different processes. For H n f. and H n the first row shows the bias plotted 
against 1/logra, for window lengths n = 10 3 , 10 4 , 10 5 , 10 6 , and with a fixed number of 
matches k = 10 6 ; the second row shows the standard error plotted against l/\/fc, where 
k = 10 4 , 10 5 , 10 6 , with fixed n = 10 6 . For H N 

oo,ctw; the first row shows the bias when 
applied to data of the same total length N = n + 2k; for its standard error see Figure El 
The true values of the entropy rates of the four processes are H = 0.1414, H = 0.4971, 
H = 0.7479 and H = 0.6946, respectively. 



standard error of the CTW estimator is not shown in Figure [H because it cannot be 
compared to the behavior of the LZ-based estimators as the number of matches k varies; 
see Figure [2] for estimates of the CTW standard error on the same set of experiments.] 

The results of the CTW are generally much more accurate than those of the LZ esti- 
mators, except in the case of the 10th order Markov chain with small data size, where the 
LZ-based methods seem to get better results. 

The values of the bias and standard error of H n ,k and H n ,k in Figure CD suggest that, 
in order to minimize the total mean squared error (MSE) of the LZ-based estimates, the 
number of matches k should be chosen to be small relative to the window length n, since 
the variance clearly appears to decay much faster than the square of the bias. This is 
further confirmed by the results shown in Table (U which shows corresponding estimates 
for an i.i.d. process with p = 0.25 and data length N = 10 6 . The values of n and k satisfy 
n + k = N — 2 log N. The ratio n/k ranges from 1 to 10,000. 

Next, the validity of the bootstrap procedure for the standard error of the LZ estimators 
H n k and H n k is examined; recall the description in Section 12.41 For data generated from 
the same four processes, the bootstrap estimate of the standard error is compared to that 
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H n ,k 


n/k 


n 


k 


bias 


std err 


MSE 


bias 


std err 


MSE 


1 


499980 


499980 


-0.0604 


0.0010 


0.1824 


-0.0325 


0.0009 


0.0528 


10 


909054 


90906 


-0.0584 


0.0018 


0.1705 


-0.0318 


0.0019 


0.0507 


100 


990059 


9901 


-0.0578 


0.0066 


0.1692 


-0.0315 


0.0067 


0.0517 


1000 


998961 


999 


-0.0553 


0.0200 


0.1732 


-0.0297 


0.0210 


0.0663 


10000 


999860 


100 


-0.0563 


0.0570 


0.3209 


-0.0356 


0.0574 


0.2280 



Table 1: Choosing n and k to minimize MSE. Data are generated from an i.i.d. process 
with p = 0.25, the true entropy rate is H = 0.8113, the data length N = 10 6 , and 
n + k = N - 2 log N. 

obtained empirically from the sample standard deviation computed from 50 independent 
repetitions of the same experiment. Since that the natural domain of applicability of the 
bootstrap method is for large values of k, Tables [2] and [3] contain simulation results for 
k = 10 5 and k = 10 6 , with n = 10 3 . The results clearly indicate that the bootstrap 
procedure indeed gives accurate estimates of the standard error. 







H 


i,k 






k = 10 5 


k = 10 e 


case 


bootstrap 


std dev 


bootstrap 


std dev 


1 


0.0018 


0.0025 


0.0006 


0.0009 


2 


0.0025 


0.0023 


0.0008 


0.0009 


3 


0.0015 


0.0019 


0.0005 


0.0007 


4 


0.0061 


0.0073 


0.0017 


0.0023 



Table 2: Comparison between the bootstrap standard error and the empirical estimate of 
the standard deviation for the four different processes. The window size n is fixed at 10 3 . 

Corresponding experiments were performed for H n and H n , applied to data from the 
same four types of processes. Although for these estimators there is little in the way 
of rigorous theory that can be used as a guideline to compute their mean and variance, 
simple heuristic calculations strongly suggest that they should decay like 0(1/ log n) and 
0(1/ n), respectively. Figure shows the bias and standard error computed empirically 
as for H n k and H n ^, and plotted against 1/logn and 1/^fn, respectively. Once again, 
the fact that the resulting empirical curves are approximately linear agrees with these 
predictions. Observe that the bias of H n can be either positive or negative; the same 
behavior was observed for H n in numerous simulation experiments. 

The fact that the standard error converges much faster than the bias strongly suggests 
that, for all four LZ-based estimators, it is the bias that dominates the estimation error, 
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H 


i,k 






k = 10 5 


k = 10 d 


case 


bootstrap 


std dev 


bootstrap 


std dev 


1 


0.0033 


0.0033 


0.0011 


0.0010 


2 


0.0031 


0.0025 


0.0010 


0.0009 


3 


0.0017 


0.0016 


0.0006 


0.0006 


4 


0.0032 


0.0031 


0.0009 


0.0010 



Table 3: Comparison between the bootstrap standard error and the empirical estimate of 
the standard deviation for the four different processes. The window size n is fixed at 10 3 . 



even in these simple cases of processes with fairly short memory. 

Figure [2] also shows the bias and standard error of trie CTW estimator H n ^ ctw 

Its 

bias appears to generally converge significantly faster than that of the LZ-based methods, 
while its standard error is very close to that of H n and H n , apparently converging to zero 
at a very similar rate. Nevertheless, the results show that the main source of error of the 
CTW is again from the bias. 

Finally we note that, in the results shown, as well as in numerous other simulation runs, 
we found that the two increasing-window LZ estimators H n and H n generally performed 
better, and always at least as well, as the corresponding sliding-window estimators H n ^ 
and H n ^. Specifically, the simulation results show that the optimal choice of parameters 
for H n ^ and H n ,k leads to estimates whose bias is very similar to that of H n and H n , 
whereas the increasing-window estimators have significantly smaller variance. 

3.3 Comparison of the Different Entropy Estimators 

This section contains a systematic comparison of the performance of all the estimators 
introduced so far: The plug-in method, the LZ-based estimators, the CTW algorithm, 
and the renewal entropy estimator. All the estimators are applied to different types of 
simulated binary data, generated from processes with different degrees of dependence and 
memory. The main figure of merit adopted here is the ratio xMp^ between the square- 
root of the mean square error (MSE), and the true entropy rate. The corresponding ratios 
of the bias to the entropy and of the standard error to the entropy are also considered. 

Since, as noted earlier, the two increasing- window LZ estimators H n , H n generally 
perform better or at least as well as their sliding-window counterparts, H n ^, H n ^, from 
now on we restrict attention to H n and H n . 

Table H] shows the results obtained by the plug- in for word-lengths w = 15 and w = 20, 
the LZ-estimators H n and H n , and the CTW estimator -ff n ,oo,ctw, on data generated from 
the same four finite-memory processes as above: An i.i.d. process and three Markov chains 
of order £ = 1,2 and 10. Again we observe that the main source of error is from the bias for 
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Poisson,p=0.02 first order MC 2nd order MC 1 0-th order MC 




2 4 6 8 10 12 14 2 4 6 8 10 12 14 2 4 6 8 10 12 14 2 4 6 8 10 12 14 



x 10" 3 



Figure 2: Results obtained by H n (shown as red lines with circles), by H n (blue lines with 
triangles), and by the CTW estimator (green lines with squares), applied to data from four 
different processes. For H n and H n , the first row shows the bias plotted against 1/logn, 
for window lengths n = 5000, 10 4 , 10 5 , 10 6 , and the second row shows the standard error 
plotted against l/^/n. Similarly, for H n ,oo,ctw, the two rows show the bias and standard 
error when the estimator is applied to data of the same total length. The true values of 
the entropy rates of the four processes are the same as before. 



all five estimators. The CTW appears to have the smallest bias, while the standard error 
varies little among the different estimators. More specifically, the results demonstrate that, 
for short memory processes, the plug-in estimates are often better than those obtained by 
the LZ estimators, whereas for the 10-th order Markov chain the plug-in with word-length 
w = 20 is much worse than H n and H n because of the undersampling problem mentioned 
earlier. The CTW estimator performs uniformly better than all the other estimators, for 
both short and relatively long memory processes. Its fast convergence rate outperforms 
the LZ-based estimators, and its ability to look much further into the past makes it much 
more accurate than the plug-in. 

Table [5] shows corresponding results for data generated by two different binary HMMs; 
recall the relevant description from Section f3.1.3l The first one has three (hidden) states, 
A = {ri,r2,rs}, where t\ = 0.005, T2 = 0.02, r% = 0.05. The transition matrix of Y 
has PrjYn+i = r\Y n = r} = 1 — e, and Prjl^+i = r' \ Y n = r} = e/2, for any r and 
r' 7^ r, where e = 0.001. Given the sequence Y = {Y n } of hidden states, the observations 
X = {Xn} are conditionally independent samples, where each X n is a binary random 
variable with Vi{X n = l\Y n = r} = 1 — Pr{X n = Q\Y n = r} = r. In the second example, 
the hidden process Y = {Y n } takes values in the set A of 50 real numbers evenly spaced 
between 0.001 and 0.1. The evolution of the process Y = {Y n } is that of a nearest- 
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Data model 




u> = 15 


w = 20 




ffyi 


CTW 


i.i.d. 


% of bias 


0.001 


-0.10 


-14.47 


9.98 


0.04 




% of stdprr 


0.57 


0.51 


0.77 


0.83 


0.51 




% of yjMSE 


0.57 


0.52 


14.49 


10.01 


0.52 


1st order MC 


% of bias 


-0.11 


-0.78 


-10.38 


0.70 


0.02 




% of stderr 


0.22 


0.16 


0.15 


0.12 


0.21 




% of Vmse 


0.25 


0.80 


10.38 


0.71 


0.21 


2nd order MC 


% of bias 


4.16 


1.72 


-5.32 


-1.56 


0.02 




% of stderr 


0.08 


0.08 


0.04 


0.04 


0.09 




% of Vmse 


4.16 


1.73 


5.32 


1.56 


0.09 


10th order MC 


% of bias 


16.04 


10.03 


-2.66 


6.23 


0.77 




% of stderr 


0.19 


0.14 


0.13 


0.12 


0.16 




% of Vmse 


16.04 


10.04 


2.66 


6.24 


0.78 



Table 4: Comparison of the ratios between the bias, the standard error and the square-root 
of the MSE over the true entropy rate. Five estimators are used on four different types of 
data. All estimators are applied to data of the same length n = 10 6 . Results are shown 
as percentages, that is, all ratios are multiplied by 100. 



neighbor random walk; Y n stays constant with probability (1 — e) and it moves to either 
one of its two neighbors with probability e/2, with e = 0.02. The conditional distribution 
of X given Y is the same as before. 

In both cases, the "true" entropy of the underlying HMM was computed using the ap- 
proximation procedure described earlier. In the first example, 10 independent realizations 
of the HMM were used according to the formula of Proposition 13. 1\ and the average of 
the resulting estimates was taken to be the true entropy rate in the calculations of the 
bias and standard error in Table El The same procedure was applied to three independent 
realizations of the second example. 

The results on HMM data are quite similar to those obtained on processes with short 
memory shown in Tabled! The LZ-based estimators have heavy biases, whereas both the 
plug-in and the CTW method give very accurate estimates. This is probably due to the 
fact that, although HMMs in general have infinite memory, the memory of an HMM with 
a finite number of states decays exponentially, therefore only the short-range statistical 
dependence in the data is significant. 

To simulate binary data strings with potentially longer memory (and with character- 
istics that are generally closer to real neuronal spike trains), we turn to renewal processes 
X = {X n } with ISIs Y = {Yj} that are distributed according to a mixture of discretized 
Gamma distributions; recall the description of Section 13.1.41 Here the Y{ are taken to be 
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plug-in 


plug-in 








EMM 




w = 15 


io = 20 


H n 


H n 


CTW 


3 states 


% of bias 


4.05 


3.69 


-43.41 


11.46 


2.51 




% of stderr 


2.43 


2.43 


1.79 


2.64 


2.41 




% of Vmse 


4.74 


4.43 


43.47 


11.75 


3.50 


50 states 


% of bias 


2.98 


2.53 


-35.62 


5.39 


2.31 




% of stderr 


3.26 


3.25 


3.15 


3.35 


3.26 




% of Vmse 


4.43 


4.12 


35.76 


6.33 


4.00 



Table 5: Comparison of the ratios between the bias, the standard error and the square- 
root of the MSE over the true entropy rate. Five estimators are used on data generated 
from two different HMMs. All estimators are applied to data of the same length n = 10 6 . 
Results are shown as percentages, that is, all ratios are multiplied by 100. 



i.i.d. with distribution P = (pj) given by, 

Pj = M/l(i) + (1 - P)h(j), 3 = 1,2,3, ... , 

where \x € (0, 1) is the mixing proportion and each /j is a (discretized) Gamma density 
with parameters (aj,/3j), respectively. 

If the ISI distribution P was simply Gamma(a,/3), then the rate E{X\) of the binary 
process X would be the reciprocal of E(Y\) = af3; therefore, taking a mixture of two 
Gamma densities, one with a(5 small and one with af3 large, gives an approximate model of 
a "bursty" neuron, that is, a neuron which sometimes fires several times in rapid succession, 
and sometimes does not fire for a long period. Figure [3] shows the result of such a simulated 
process X . The empirical ISI density resembles a real neuron's ISI distribution, and the 
peak near the zero lag in the autocorrelogram of {AT n } indicates the bursty behavior. 

Table [6] shows the results of the estimates obtained by the renewal entropy estimator, 
the plug-in with word-length w = 20, the two increasing-window LZ-based estimators, 
and the CTW method, applied to data generated by a binary renewal process. The ISI 
distribution P is a mixture of two (discrete) Gamma densities /i and /2, where fx has 
fixed parameters {a\,f3\) = (2, 10) that represent the bursting regime, and fa takes three 
different sets of (much larger) parameters (02,^2), representing low frequency firing. 

The CTW and renewal estimator consistently outperform the other methods, and, 
on the other extreme, the LZ-based estimators have high biases and give relatively poor 
results. The plug-in estimator only considers what happens in 20-bit-long windows, and 
therefore it misses all the data features beyond this range. Especially in the case of large 
parameters (0:2, $2) where the ISI distribution has heavier tails, the structure and the 
dependence in the data becomes significantly longer in range. Also, in that regime, the 
resulting number of ISI data points y% is much smaller, and therefore, the empirical ISI 
distribution is less accurate; as a result, the renewal entropy estimator also becomes less 
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Figure 3: Simulated binary renewal process with characteristics similar to those of a bursty 
neuron. The ISI distribution is given by a mixture of discrete Gammas, yjj/(2,io)+ 1^/(50,50) • 
The "rate" of the process is E(X 1 ) = 0.00398. The first plot shows the first 10000 simulated 
bits of X; the second plot shows the empirical ISI distribution; the last plot shows the 
autocorrelogram of X for values of the lag between and 500. 

accurate. The situation is similar for the CTW. As shown in [12] [13], the CTW method also 
essentially approximates the empirical ISI distribution (although it does so in a different, 
more efficient way than the renewal entropy estimator), and for larger values of (a?,, fo) 
its estimates become less accurate, in a way analogous to the renewal entropy estimator. 
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renewal 


nluff-in 












entropy 


w = 20 


H n 


H n 


CTW 


(10,20) 


% of bias 


-0.06 


6.09 


-20.98 


21.84 


1.66 




% of stderr 


0.73 


0.77 


0.66 


0.95 


0.72 




% of Vmse 


0.74 


6.14 


20.99 


21.86 


1.81 


(50,20) 


% of bias 


-1.53 


25.90 


-30.38 


81.40 


7.64 




% of stderr 


2.37 


3.03 


0.79 


2.67 


2.38 




% of Vmse 


2.82 


26.08 


30.39 


81.45 


8.00 


(50,50) 


% of bias 


-5.32 


34.35 


-50.64 


85.61 


3.58 




% of stderr 


2.36 


3.12 


0.69 


2.67 


2.42 




% of Vmse 


5.82 


34.49 


50.65 


85.65 


4.32 



Table 6: Comparison of the ratios between the bias, the standard error and the square-root 
of the MSE over the true entropy rate. Five estimators are used on three different data 
sets generated by a renewal process whose ISI distribution P is a mixture of Gammas, 
P = m/(2,io) + (1 — / u )/(o2,/?2)- The mixing parameter [i = 0.8 in the first two cases, and 
[i = 0.9 in the last case. All estimators are applied to data of the same total length 
n = 10 6 . Results are shown as percentages, that is, all ratios are multiplied by 100. 

4 Summary and Concluding Remarks 

A systematic and extensive comparison between several of the most commonly used and 
effective entropy estimators for binary time series was presented. Those were the plug-in 
or "maximum likelihood" estimator, four different LZ-based estimators, the CTW method, 
and the renewal entropy estimator. 

Methodology. Three new entropy estimators were introduced; two new LZ-based estima- 
tors, and the renewal entropy estimator, which is tailored to data generated by a binary 
renewal process. A bootstrap procedure, similar to the one employed in [44], was de- 
scribed, for evaluating the standard error of the two sliding-window LZ estimators H n ^ 
and H n ^. Also, for these two estimators, a practical rule of thumb was heuristically 
derived for selecting the values of the parameters n and k in practice. 

Theory. It was shown (Theorem 12. ip that the two new LZ-based estimators H n ^ and H n 
are universally consistent, that is, they converge to the entropy rate for every finite- valued, 
stationary and ergodic process. Unlike the corresponding estimators H n ,k and H n of [22J, 
no additional conditions are required. An effective method was derived (Proposition 13. 1| ) 
for the accurate approximation of the entropy rate of a finite-state HMM with known 
distribution. Heuristic calculations were presented and approximate formulas derived for 
evaluating the bias and the standard error of each estimator. 

Simulation. Several general conclusions can be drawn from the simulation experiments 
conducted. 
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(i) For all estimators considered, the main source of error is the bias. 

(ii) Among all the different estimators, the CTW method is the most effective; 
it was repeatedly and consistently seen to provide the most accurate and 
reliable results. 

(iii) No significant benefit is derived from using the finite-context-depth version 
of the CTW. 

(iv) Among the four LZ-based estimators, the two most efficient ones are those 
with increasing window sizes, H n , H n . No systematic trend was observed 
regarding which one of the two is more accurate. 

(v) Interestingly (and somewhat surprisingly), in many of our experiments the 
performance of the LZ-based estimators was quite similar to that of the 
plug-in method. 

(vi) The main drawback of the plug-in method is its computational inefficiency; 
with small word-lengths it fails to detect longer-range structure in the data, 
and with longer word-lengths the empirical distribution is severely under- 
sampled, leading to large biases. 

(vii) The renewal entropy estimator, which is only consistent for data gener- 
ated by a renewal process, suffers a drawback similar (although perhaps less 
severe) to the plug-in. 

In closing we note that much of the work reported here was done as part of the 
first author's Ph.D. thesis [12]. Several new estimators and results have appeared in the 
literature since then, perhaps most notably in [6]. There, a different entropy estimator is 
introduced based on the Burrows- Wheeler transform (BWT), it is shown to be consistent 
for all stationary and ergodic processes, and bounds on its convergence rate are obtained. 
Moreover, simulation experiments on binary data indicate that it significantly outperforms 
the plug-in estimator as well as a modified version of the LZ-based estimator H n of [22]. In 
view of the present results, interesting further work would include a detailed comparison 
of the BWT estimator of [6] with the CTW method. 
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